Executive Summary

This population study examined employee retention in a North American Acute care Hospital over a 10 year period to determine what factors influence employee turnover. The hospital was interested in building a model that could be used to predict which employees had a higher likelihood of leaving the organization so that timely interventions could be executed to improve employee retention.

The main questions that needed to be addressed were -

  1. What factors affect voluntary employee turnover?

  2. How can employee turnover be predicted?

  3. How can voluntary, regrettable, avoidable employee turnover be reduced?

HR Analytical Model

HR experts were consulted and HR theories such as Human Capital theory, Social Identity theory, Expectancy and Distributive justice theory were considered while carefully selecting the various factors that may affect employee turnover. The goal of the project was to build a simple actionable analytical model, simple enough for HR executives to understand, actionable, provide clear insights and assist in developing a retention campaign.

Employee termination was classified into the following types -

The termination type of interest was VTO-R-A - Voluntary turnover - Regrettable - Avoidable

Factors selected for analysis were grouped into 3 main categories -

Socio-Demographic variables -

Work variables -

Psychometric variables -

Other factors that were important but were left out due to lack of reliable data or due to client reluctance to include in model -

Exploratory Data Analysis

2000 employee records over a period of 10 years was selected for model building. The records mainly consisted of voluntary regrettable and avoidable turnover employees and several active employees randomly selected. A summary and sample of the data can be seen below -

library(rpart.plot)
## Loading required package: rpart
library(rpart)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(car)
library(randomForest)
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
setwd("~/Documents/BI-Dashboards")
training <- read.csv("HRAnalytics_Model.csv")

dim(training)
## [1] 2000   29
summary(training)
##  EmployeePPPER       Status        ZipCode    Sex            Eeoc     
##  Min.   :   1.0   ACTIVE: 900   75028  :317   F:1679   ASIAN   :   9  
##  1st Qu.: 503.8   TERM  :1100   75032  :210   M: 321   BLACK   :   6  
##  Median :1007.5                 75018  :208            HISPANIC:   4  
##  Mean   :1018.2                 75022  :207            WHITE   :1981  
##  3rd Qu.:1549.2                 75042  :207                           
##  Max.   :2052.0                 75012  :105                           
##                                 (Other):746                           
##      Shift          BaseRate          Grade          Training   
##  Min.   :1.000   Min.   :  0.00   Min.   : 0.00   Min.   :   0  
##  1st Qu.:1.000   1st Qu.:  8.14   1st Qu.: 3.75   1st Qu.:3750  
##  Median :1.000   Median : 11.73   Median : 6.00   Median :5000  
##  Mean   :1.266   Mean   : 15.09   Mean   :18.48   Mean   :4074  
##  3rd Qu.:1.000   3rd Qu.: 17.35   3rd Qu.:37.00   3rd Qu.:5000  
##  Max.   :3.000   Max.   :185.10   Max.   :99.00   Max.   :5000  
##                                                                 
##     LastEval        StartEval        Education        StartGrade   
##  Min.   : 0.000   Min.   : 0.000   Min.   : 0.000   Min.   : 0.00  
##  1st Qu.: 3.000   1st Qu.: 1.000   1st Qu.: 0.000   1st Qu.: 2.00  
##  Median : 5.000   Median : 3.000   Median : 1.000   Median : 6.00  
##  Mean   : 5.171   Mean   : 3.192   Mean   : 2.026   Mean   :15.96  
##  3rd Qu.: 7.000   3rd Qu.: 5.000   3rd Qu.: 4.000   3rd Qu.:29.00  
##  Max.   :10.000   Max.   :10.000   Max.   :10.000   Max.   :95.00  
##                                                                    
##    StartRate        Experience    EngagementIndex  CommitmentIndex 
##  Min.   :  0.04   Min.   : 0.00   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:  6.00   1st Qu.: 0.00   1st Qu.: 3.000   1st Qu.: 4.000  
##  Median :  6.00   Median : 7.00   Median : 5.000   Median : 5.000  
##  Mean   : 11.34   Mean   :10.63   Mean   : 5.121   Mean   : 5.295  
##  3rd Qu.: 12.51   3rd Qu.:18.00   3rd Qu.: 8.000   3rd Qu.: 7.000  
##  Max.   :177.93   Max.   :57.00   Max.   :10.000   Max.   :10.000  
##                                                                    
##   Unscheduled         Tenure            Age           EntryAge    
##  Min.   : 0.000   Min.   : 0.000   Min.   :16.00   Min.   :14.00  
##  1st Qu.: 2.000   1st Qu.: 1.000   1st Qu.:30.00   1st Qu.:21.00  
##  Median : 5.000   Median : 4.000   Median :41.00   Median :28.00  
##  Mean   : 4.896   Mean   : 7.868   Mean   :41.75   Mean   :30.94  
##  3rd Qu.: 7.000   3rd Qu.:11.000   3rd Qu.:52.00   3rd Qu.:39.00  
##  Max.   :10.000   Max.   :52.000   Max.   :85.00   Max.   :78.00  
##                                                                   
##    CareerAge      CareerStage    ServiceLength     SalaryProg    
##  Min.   :15.00   Min.   :1.000   Min.   :1.000   Min.   :-1.750  
##  1st Qu.:26.00   1st Qu.:2.000   1st Qu.:2.000   1st Qu.: 0.220  
##  Median :38.00   Median :3.000   Median :3.000   Median : 0.540  
##  Mean   :38.84   Mean   :2.614   Mean   :3.419   Mean   : 1.246  
##  3rd Qu.:50.00   3rd Qu.:4.000   3rd Qu.:5.000   3rd Qu.: 1.452  
##  Max.   :83.00   Max.   :4.000   Max.   :6.000   Max.   : 9.950  
##                                                                  
##   JobLevelProg        AvgEval       LastPromotion    TotalEmployment
##  Min.   :-8.0000   Min.   : 0.000   Min.   :-2.000   Min.   : 0.0   
##  1st Qu.: 0.0000   1st Qu.: 3.000   1st Qu.: 1.000   1st Qu.: 6.0   
##  Median : 0.3300   Median : 4.000   Median : 2.000   Median :17.0   
##  Mean   : 0.6804   Mean   : 4.182   Mean   : 4.369   Mean   :18.5   
##  3rd Qu.: 1.2900   3rd Qu.: 5.000   3rd Qu.: 5.000   3rd Qu.:29.0   
##  Max.   :10.0000   Max.   :10.000   Max.   :51.000   Max.   :61.0   
## 
head(training)
##   EmployeePPPER Status ZipCode Sex  Eeoc Shift BaseRate Grade Training
## 1             1   TERM   75018   F WHITE     1    18.47    45     5000
## 2             2 ACTIVE   75002   F WHITE     1    17.35    25     5000
## 3             3 ACTIVE   75018   F WHITE     2    24.95    55     5000
## 4             4 ACTIVE   75062   F WHITE     3    17.35    25     5000
## 5             5 ACTIVE   75020   F WHITE     1    11.70     6     5000
## 6             6   TERM   75042   M WHITE     1    22.43    58     5000
##   LastEval StartEval Education StartGrade StartRate Experience
## 1        1         2         5         40     15.66          0
## 2        4         0         3         21     14.44          1
## 3        2         3         6         48     22.13          5
## 4        4         1         3         19     11.65          3
## 5        0         5         1          5      6.00          0
## 6        1         4         6         57     17.51          4
##   EngagementIndex CommitmentIndex Unscheduled Tenure Age EntryAge
## 1               2               4           7     17  48       21
## 2              10               9           4     14  36       22
## 3               0               3          10     29  54       26
## 4               3               6           5     12  36       24
## 5               8               6           7     29  44       15
## 6               8               6           0      2  37       25
##   CareerAge CareerStage ServiceLength SalaryProg JobLevelProg AvgEval
## 1        38           3             5       0.17         0.29     1.5
## 2        36           3             5       0.21         0.29     2.0
## 3        54           4             6       0.10         0.24     2.5
## 4        36           3             5       0.48         0.50     2.5
## 5        44           3             6       0.20         0.03     2.5
## 6        27           2             3       2.46         0.50     2.5
##   LastPromotion TotalEmployment
## 1            17              17
## 2             2              15
## 3            11              34
## 4            12              15
## 5            11              29
## 6             2               6

The main variables were grouped by category and scatter plots were constructed to analyse relationships among variables.

scatterplotMatrix(~Shift+BaseRate+Grade+LastEval+LastPromotion+Age|Status,data=training)

scatterplotMatrix(~Education+StartGrade+StartRate+StartEval+Experience|Status,data=training)

scatterplotMatrix(~SalaryProg+JobLevelProg+AvgEval+Tenure+CareerAge|Status,data=training)

scatterplotMatrix(~EngagementIndex+CommitmentIndex+Unscheduled|Status,data=training)

From the scatterplots relationships among various variables can be seen and their impact on Status - active vs terminated employees.

Predictive Analysis

Since this is a binomial prediction of active vs terminated employees, a decision tree structure is the best method to construct to a prediction model.

We divide the data set into training and validation sets. The training set is then divided into training and test set. Various models are trained on the training set and tested on the test set. Then the best model is then validated on the validation set and measured for accuracy.

partition <- createDataPartition(y = training$Status, p = 0.8, list = FALSE)
training <- training[partition, ]
testing <- training[-partition, ]

partition <- createDataPartition(y = training$Status, p = 0.6, list = FALSE)
traindata <- training[partition, ]
testdata <- training[-partition, ]

RPart

After partitioning the data sets - the rpart method is used to construct a decision tree and check accuracy

tree <- rpart(Status ~ ZipCode+Sex+Eeoc+Shift+BaseRate+Grade+Training+LastEval+StartEval+Education+StartGrade+StartRate+Experience+EngagementIndex+CommitmentIndex+Unscheduled+Tenure+Age+EntryAge+CareerAge+CareerStage+ServiceLength+SalaryProg+JobLevelProg+AvgEval+LastPromotion+TotalEmployment, data=traindata)
tree
## n= 960 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##   1) root 960 432 TERM (0.45000000 0.55000000)  
##     2) LastPromotion>=0.5 798 366 ACTIVE (0.54135338 0.45864662)  
##       4) BaseRate>=9.205 601 222 ACTIVE (0.63061564 0.36938436)  
##         8) Age< 62.5 536 179 ACTIVE (0.66604478 0.33395522)  
##          16) Tenure>=3.5 397 106 ACTIVE (0.73299748 0.26700252)  
##            32) CommitmentIndex< 5.5 212  39 ACTIVE (0.81603774 0.18396226)  
##              64) CareerAge>=49.5 79   6 ACTIVE (0.92405063 0.07594937) *
##              65) CareerAge< 49.5 133  33 ACTIVE (0.75187970 0.24812030)  
##               130) Age< 49.5 122  22 ACTIVE (0.81967213 0.18032787) *
##               131) Age>=49.5 11   0 TERM (0.00000000 1.00000000) *
##            33) CommitmentIndex>=5.5 185  67 ACTIVE (0.63783784 0.36216216)  
##              66) ZipCode=75012,75020,75028,75032,75038,NULL 71  11 ACTIVE (0.84507042 0.15492958) *
##              67) ZipCode=75001,75002,75011,75018,75022,75042,75055,75062 114  56 ACTIVE (0.50877193 0.49122807)  
##               134) Shift< 1.5 97  42 ACTIVE (0.56701031 0.43298969)  
##                 268) EngagementIndex< 5.5 28   6 ACTIVE (0.78571429 0.21428571) *
##                 269) EngagementIndex>=5.5 69  33 TERM (0.47826087 0.52173913)  
##                   538) JobLevelProg< 0.175 19   5 ACTIVE (0.73684211 0.26315789) *
##                   539) JobLevelProg>=0.175 50  19 TERM (0.38000000 0.62000000) *
##               135) Shift>=1.5 17   3 TERM (0.17647059 0.82352941) *
##          17) Tenure< 3.5 139  66 TERM (0.47482014 0.52517986)  
##            34) Experience>=28.5 19   3 ACTIVE (0.84210526 0.15789474) *
##            35) Experience< 28.5 120  50 TERM (0.41666667 0.58333333)  
##              70) Age< 31.5 43  17 ACTIVE (0.60465116 0.39534884)  
##               140) ZipCode=75011,75012,75018,75022,75028,75032,75042,75055,75062 36  10 ACTIVE (0.72222222 0.27777778) *
##               141) ZipCode=75001,75002,75038,NULL 7   0 TERM (0.00000000 1.00000000) *
##              71) Age>=31.5 77  24 TERM (0.31168831 0.68831169) *
##         9) Age>=62.5 65  22 TERM (0.33846154 0.66153846)  
##          18) CareerAge>=62.5 44  22 ACTIVE (0.50000000 0.50000000)  
##            36) Age< 68.5 16   2 ACTIVE (0.87500000 0.12500000) *
##            37) Age>=68.5 28   8 TERM (0.28571429 0.71428571)  
##              74) CareerAge>=70.5 7   1 ACTIVE (0.85714286 0.14285714) *
##              75) CareerAge< 70.5 21   2 TERM (0.09523810 0.90476190) *
##          19) CareerAge< 62.5 21   0 TERM (0.00000000 1.00000000) *
##       5) BaseRate< 9.205 197  53 TERM (0.26903553 0.73096447)  
##        10) Age< 19.5 18   2 ACTIVE (0.88888889 0.11111111) *
##        11) Age>=19.5 179  37 TERM (0.20670391 0.79329609)  
##          22) CareerAge>=19.5 136  37 TERM (0.27205882 0.72794118)  
##            44) Age< 22.5 17   4 ACTIVE (0.76470588 0.23529412) *
##            45) Age>=22.5 119  24 TERM (0.20168067 0.79831933) *
##          23) CareerAge< 19.5 43   0 TERM (0.00000000 1.00000000) *
##     3) LastPromotion< 0.5 162   0 TERM (0.00000000 1.00000000) *
printcp(tree)
## 
## Classification tree:
## rpart(formula = Status ~ ZipCode + Sex + Eeoc + Shift + BaseRate + 
##     Grade + Training + LastEval + StartEval + Education + StartGrade + 
##     StartRate + Experience + EngagementIndex + CommitmentIndex + 
##     Unscheduled + Tenure + Age + EntryAge + CareerAge + CareerStage + 
##     ServiceLength + SalaryProg + JobLevelProg + AvgEval + LastPromotion + 
##     TotalEmployment, data = traindata)
## 
## Variables actually used in tree construction:
##  [1] Age             BaseRate        CareerAge       CommitmentIndex
##  [5] EngagementIndex Experience      JobLevelProg    LastPromotion  
##  [9] Shift           Tenure          ZipCode        
## 
## Root node error: 432/960 = 0.45
## 
## n= 960 
## 
##          CP nsplit rel error  xerror     xstd
## 1  0.181713      0   1.00000 1.00000 0.035681
## 2  0.048611      2   0.63657 0.64815 0.032600
## 3  0.032407      3   0.58796 0.63194 0.032355
## 4  0.023148      4   0.55556 0.58796 0.031637
## 5  0.020833      6   0.50926 0.55787 0.031100
## 6  0.016204      7   0.48843 0.55093 0.030970
## 7  0.013889      8   0.47222 0.54167 0.030793
## 8  0.011574     10   0.44444 0.52778 0.030521
## 9  0.010417     11   0.43287 0.53704 0.030704
## 10 0.010185     13   0.41204 0.53704 0.030704
## 11 0.010000     20   0.33333 0.52546 0.030475
prp(tree)

As can be seen with this method there is a 45% error rate, which is not a good model for our prediction.

RandomForest

Next the randomforest method is applied. Randomforest is one of the best classification tree predictive models in the industry today.

model.rf<-randomForest(traindata[,3:29],traindata$Status)
model.rf
## 
## Call:
##  randomForest(x = traindata[, 3:29], y = traindata$Status) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 5
## 
##         OOB estimate of  error rate: 18.54%
## Confusion matrix:
##        ACTIVE TERM class.error
## ACTIVE    354   78   0.1805556
## TERM      100  428   0.1893939
training_pred <- predict(model, traindata)
confusionMatrix(training_pred, traindata$Status)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction ACTIVE TERM
##     ACTIVE    405   52
##     TERM       27  476
##                                           
##                Accuracy : 0.9177          
##                  95% CI : (0.8985, 0.9343)
##     No Information Rate : 0.55            
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.8346          
##  Mcnemar's Test P-Value : 0.00693         
##                                           
##             Sensitivity : 0.9375          
##             Specificity : 0.9015          
##          Pos Pred Value : 0.8862          
##          Neg Pred Value : 0.9463          
##              Prevalence : 0.4500          
##          Detection Rate : 0.4219          
##    Detection Prevalence : 0.4760          
##       Balanced Accuracy : 0.9195          
##                                           
##        'Positive' Class : ACTIVE          
## 
testdata_pred <- predict(model.rf, testdata)
confusionMatrix(testdata_pred, testdata$Status)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction ACTIVE TERM
##     ACTIVE    232   66
##     TERM       56  286
##                                           
##                Accuracy : 0.8094          
##                  95% CI : (0.7768, 0.8391)
##     No Information Rate : 0.55            
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.6161          
##  Mcnemar's Test P-Value : 0.4152          
##                                           
##             Sensitivity : 0.8056          
##             Specificity : 0.8125          
##          Pos Pred Value : 0.7785          
##          Neg Pred Value : 0.8363          
##              Prevalence : 0.4500          
##          Detection Rate : 0.3625          
##    Detection Prevalence : 0.4656          
##       Balanced Accuracy : 0.8090          
##                                           
##        'Positive' Class : ACTIVE          
## 
varImpPlot(model.rf,sort=TRUE,n.var=27)

As can be seen - the accuracy of the training data prediction is about 91% with an error rate of 18% and the test data accuracy is 80%. This is better than the previous result, but still not quite up to par for our requirements.

RandomForest with Bootstrapping

Next RandomForest with a 10-fold bootstrapping technique with replace is applied.

model <- train(traindata[,3:29],traindata$Status, method = "rf", prox = TRUE, trControl = trainControl(method = "cv", number = 10, allowParallel = TRUE))
training_pred <- predict(model, traindata)
confusionMatrix(training_pred, traindata$Status)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction ACTIVE TERM
##     ACTIVE    432    0
##     TERM        0  528
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9962, 1)
##     No Information Rate : 0.55       
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.00       
##             Specificity : 1.00       
##          Pos Pred Value : 1.00       
##          Neg Pred Value : 1.00       
##              Prevalence : 0.45       
##          Detection Rate : 0.45       
##    Detection Prevalence : 0.45       
##       Balanced Accuracy : 1.00       
##                                      
##        'Positive' Class : ACTIVE     
## 
testdata_pred <- predict(model, testdata)
confusionMatrix(testdata_pred, testdata$Status)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction ACTIVE TERM
##     ACTIVE    235   66
##     TERM       53  286
##                                           
##                Accuracy : 0.8141          
##                  95% CI : (0.7817, 0.8435)
##     No Information Rate : 0.55            
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.6259          
##  Mcnemar's Test P-Value : 0.2713          
##                                           
##             Sensitivity : 0.8160          
##             Specificity : 0.8125          
##          Pos Pred Value : 0.7807          
##          Neg Pred Value : 0.8437          
##              Prevalence : 0.4500          
##          Detection Rate : 0.3672          
##    Detection Prevalence : 0.4703          
##       Balanced Accuracy : 0.8142          
##                                           
##        'Positive' Class : ACTIVE          
## 
varImp(model)
## rf variable importance
## 
##   only 20 most important variables shown (out of 27)
## 
##                 Overall
## Age              100.00
## BaseRate          98.05
## LastPromotion     87.23
## Tenure            73.54
## ZipCode           69.43
## CareerAge         33.63
## SalaryProg        32.41
## ServiceLength     28.35
## TotalEmployment   28.24
## StartRate         27.09
## StartGrade        24.14
## JobLevelProg      23.37
## Grade             19.89
## EntryAge          19.13
## EngagementIndex   18.90
## CommitmentIndex   18.13
## Unscheduled       16.85
## AvgEval           15.82
## Experience        15.41
## LastEval          14.30

Finally we will apply the model against our validation set to test our prediction - Confusion matrix along with a list of employees that were incorrectly predicted is included for further analysis.

testing_pred <- predict(model, testing)
confusionMatrix(testing_pred, testing$Status)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction ACTIVE TERM
##     ACTIVE    144   14
##     TERM       12  163
##                                           
##                Accuracy : 0.9219          
##                  95% CI : (0.8877, 0.9484)
##     No Information Rate : 0.5315          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8433          
##  Mcnemar's Test P-Value : 0.8445          
##                                           
##             Sensitivity : 0.9231          
##             Specificity : 0.9209          
##          Pos Pred Value : 0.9114          
##          Neg Pred Value : 0.9314          
##              Prevalence : 0.4685          
##          Detection Rate : 0.4324          
##    Detection Prevalence : 0.4745          
##       Balanced Accuracy : 0.9220          
##                                           
##        'Positive' Class : ACTIVE          
## 
testing.pred<-cbind(testing,testing_pred)
subset(testing.pred, Status!=testing_pred, select=c("EmployeePPPER","Status","testing_pred"))
##      EmployeePPPER Status testing_pred
## 18              18   TERM       ACTIVE
## 119            118   TERM       ACTIVE
## 377            378   TERM       ACTIVE
## 405            407 ACTIVE         TERM
## 424            426   TERM       ACTIVE
## 719            724   TERM       ACTIVE
## 724            729   TERM       ACTIVE
## 742            749   TERM       ACTIVE
## 808            815   TERM       ACTIVE
## 843            850 ACTIVE         TERM
## 854            861   TERM       ACTIVE
## 866            873 ACTIVE         TERM
## 896            903   TERM       ACTIVE
## 1036          1043   TERM       ACTIVE
## 1090          1097   TERM       ACTIVE
## 1167          1174 ACTIVE         TERM
## 1349          1358 ACTIVE         TERM
## 1519          1568 ACTIVE         TERM
## 1543          1592 ACTIVE         TERM
## 1719          1769 ACTIVE         TERM
## 1763          1814   TERM       ACTIVE
## 1811          1863   TERM       ACTIVE
## 1828          1880 ACTIVE         TERM
## 1928          1980 ACTIVE         TERM
## 1995          2047 ACTIVE         TERM
## 2000          2052 ACTIVE         TERM

We obtain a 92% accuracy on the validation dataset which is a pretty good prediction rate.

As can be seen from both randomforest models the top 5 factors affecting terminations are -

Lets plot these 5 variables against status and check their relationships again

scatterplotMatrix(~Status+BaseRate+Age+LastPromotion+ZipCode+Tenure, data=testing)

Conclusion

From the above matrix it can be concluded that younger employees (40 years or below) with lower BaseRates (<$20/hr), who have not received a promotion in a while (last 5 years), who commute long distances to work and have worked for the organization 5 years or less tend to leave the organization